Efficient combinatorial optimization by quantum-inspired parallel annealing in analogue memristor crossbar

Combinatorial optimization problems are prevalent in various fields, but obtaining exact solutions remains challenging due to the combinatorial explosion with increasing problem size. Special-purpose hardware such as Ising machines, particularly memristor-based analog Ising machines, have emerged as promising solutions. However, existing simulate-annealing-based implementations have not fully exploited the inherent parallelism and analog storage/processing features of memristor crossbar arrays. This work proposes a quantum-inspired parallel annealing method that enables full parallelism and improves solution quality, resulting in significant speed and energy improvement when implemented in analog memristor crossbars. We experimentally solved tasks, including unweighted and weighted Max-Cut and traveling salesman problem, using our integrated memristor chip. The quantum-inspired parallel annealing method implemented in memristor-based hardware has demonstrated significant improvements in time- and energy-efficiency compared to previously reported simulated annealing and Ising machine implemented on other technologies. This is because our approach effectively exploits the natural parallelism, analog conductance states, and all-to-all connection provided by memristor technology, promising its potential for solving complex optimization problems with greater efficiency.

also implies that an arbitrary function can service as the initial Hamiltonian as long as its ground state can be easily solved.According to our empirical observation, among these three functions, ∑  ! ) & !'( works the best, so it is chosen as the  !"!#!$% throughout the paper.Nonetheless, it is possible that there are other well-designed functions that can enhance performance, which requires further exploration.The performance of the case without STE works as well as with STE when the total iteration number is small.However, with larger total iterations, the performance decreases.This is because without STE, during the whole solving process, the real binary spin is not involved, and the analog spin proxy cannot accurately reflect the result of the binary spin, which causes a mismatch 1 .Further updating on the analog spin proxy lacks the ability to lower the energy calculated by the binary spin.For the case without momentum, the system works like a greedy gradient descent algorithm, and the annealing fails.For this case, a local-minimum solution can be easily obtained, however, the global minimum can hardly be found.The role of momentum in the quantum-inspired annealing is not clear yet, but this might be related to the property of STE.Since the performance of vanilla stochastic gradient descent is much poorer than those ones with momentum in the field of binary neural network (BiNN) training 2 , similar with our observations.For the case without momentum and without STE, the annealing backs to work normally.The convergence speed is relatively slower than the case without momentum.For the case without momentum and without STE, the annealing works normally, but the convergence speed is relatively slower than the case without momentum.For the case without clipping, the analog spin proxy does not have any limitation, and it can grow infinitely towards one direction in the solving process.Then if it needs to go towards the other direction in the later solving process, it takes much longer time to accumulate the gradient until it can change its sign and thus largely decreases the solving speed.For STE, momentum, and clipping, they are all common techniques used in BiNN training.In fact, there are more advanced optimizers, such as Adam, and other sophisticated training methods that are widely applied in the field of neural network 3,4 .These techniques have the potential to be applied to combinatorial optimization, but their applicability needs to be further researched due to the different nature of the target problem.c Retention test after dividing the entire 64×64 array into 16 subarrays, with each subarray programmed to target a conductance of 0 to 150 µS with a step size of 10 µS.The mean value is shown by the dark line, while the shaded area indicates the interquartile range.d Conductance tuning by a set of SET and RESET pulses.256 SET pulses with amplitude linearly increasing from 0.6 V to 0.7 V were applied to the device followed by 256 RESET pulses with amplitude linearly decreasing from -0.5 V to -0.65 V.The pulse width for both SET and RESET pulses is 10 ns.

Figure S5. Flow chart of the proposed memristor-based Ising solver.
Steps in green boxes were conducted experimentally in the analogue memristor crossbar array, while steps in yellow boxes were computing steps in the digital domain, which can be integrated onto the chip in the future.In our QPA, the coefficient of the initial Hamiltonian should start with a large enough that the system can be initiated with a function that the ground state can be easily found, i.e. usually a convex function.Since in our updating scheme, the gradient calculated by the binary spin  ! is used as an estimator to update the analog proxy spin  ! .By replacing binary spin configuration  in the Ising Hamiltonian with analog spin configuration , we have: In this case, if  −  ≥ , the  ,-,#./becomes convex and has a ground state when all  != 0. Therefore, to guarantee a convex function,  should start with a value that is larger than the maximum eigenvalue of Ising coupling matrix .However, it cannot be too large, as with further larger initial , more iterations will be wasted and thus affect the performance.From the above plot, it can be seen that the best performance point locates where initial  value is a bit larger than the maximum eigenvalue of Ising coupling matrix, which further prove our theory.From the plot it can be seen that, when the step size is relatively small, increasing it can lead to improved system performance due to each iteration progressing further in the direction of gradient.However, as the step size becomes sufficiently large, the system will diverge.This aligns with the experience in neural network training.In fact, more advanced adaptive changing of step sizes is commonly implemented in the training of neural networks 3,5,6 .The selection of optimal hyperparameters including step size, the coefficient of momentum term, is a vast and intricate problem that requires further research in the future.S12(c) to N=2000.One major challenge for benchmarking the performance of large-size problems is to find exact solutions for them.As brute-of-force search becomes impossible for problems with N>60 even with the most advanced digital hardware due to the exponential explosion nature of the problem.Thus, Gset problems drawn from a public problem set are used for benchmarking.It can be seen that the scaling trend of solving larger problems is basically consistent with the fitness curve found in Fig. S12(c) with some deviations.There are two major reasons for the deviation.Firstly, problem instances in the Gset are much sparser than problems we solved in Fig. S12 (density=0.06 for N=800, density=0.01for N=2000 compared with density=0.5 for N=10-100) and are generally easier to be solved according to our finding in Fig. 2(g) of the main text.Secondly, the simulation conducted in Fig. S12 considered the variation of the memristor device, while the simulation here is defect-free, which results in higher performance.For each problem instance, 100 random initialized trials are conducted to calculate the success probability.It can be seen that the histogram for QPA without STE is bimodal, meaning that it separates the problem instances into two groups, hard ones and easy ones.The algorithm almost always succeeds on easy problems and fails on hard problems.This kind of bimodal distribution is similar to the result obtained by the D-Wave quantum system and classical dynamics 7 .However, for original QPA and SA, the distribution is relatively unimodal.Further increasing of the total iteration number drives the distribution of QPA without STE more bimodal (The number of both easy and hard problems increases, while the number of problems in the middle decreases) but does not significantly improve the overall success probability.However, for both original QPA and SA, increased iterations cause the unimodal distribution to shift right, which stands for overall improved performance.The detailed mechanism behind the clear separation of hard and easy problems remains unclear.Here, we have a preliminary intuitive explanation.
For QPA without STE, the whole solving process is based on the evolution of the analog spin value, meaning that the system tends to evolve to the ground state in the "analog" solution space, which might deviate from the true ground state in the binary solution space for some problem instances.Increasing iterations increases the ability to find the ground state in the "analog" solution space but does not help with finding the true ground state of binary spin.However, the introduction of STE in our QPA utilizes the binary spin value for the gradient calculation, which involves the true binary spin in the solving process and thus corrects the mismatch caused by the analog variable and leads to the true ground state.This might also explain why our method behaves better than those based on continuous-time dynamics such as optical and electronic oscillator-based systems.

Figure S2 .
Figure S2.Comparison of different updating techniques in unweighted Max-Cut problem.a Average success probability vs. total iteration number.b Average solution gap (defined as the difference between the solved cut value and the maximum cut value) vs. total iteration number.Five different updating techniques are compared: the original updating algorithm used in the paper (denoted as Baseline in the figure), one without discretizing the proxy spin value when calculating the gradient (STE), one without adding a momentum term, one without momentum and without STE, and one without clipping the proxy spin value to be between -1 and 1.For each data point in the plot, 20 60node 50%-density unweighted Max-Cut problem instances are solved.For each problem instance, 100 trials from random initial states are conducted.From the plot, it can be seen that all three techniques, STE, momentum, and clipping, are essential to the performance.

Figure S3 .
Figure S3.Experiment set-up and integrated memristor chip.a The overall experiment set-up.A PC is used to control the integrated memristor chip via microcontroller.b The printed circuit board (PCB) developed to measure the chip.c Optical image of the integrated memristor chip, which consists of three 64×64 1T1M memristor arrays and peripheral circuits including transimpedance amplifiers (TIAs), multiplexers (MUXs), analog-to-digital converters (ADCs) and drivers.CMOS peripheral circuits were taped out by TSMC's 180 nm technology, while the memristor devices were fabricated at backend of line with in-house cleanroom.d Optical image of the memristor crossbar array.e SEM image of the cross-point memristor device f Lateral TEM image of the 50 nm×50 nm Pt/Ta/TaOx/Pt memristor device.

Figure S4 .
Figure S4.Performance of memristor device in our integrated hardware.a DC I-V measurement of 100 Set and Reset cycles demonstrating small cycle-to-cycle variations.The average value is shown by the dark line.b Linearity test of different conductance states, demonstrating excellent I-V linearity.cRetention test after dividing the entire 64×64 array into 16 subarrays, with each subarray programmed to target a conductance of 0 to 150 µS with a step size of 10 µS.The mean value is shown by the dark line, while the shaded area indicates the interquartile range.d Conductance tuning by a set of SET and RESET pulses.256 SET pulses with amplitude linearly increasing from 0.6 V to 0.7 V were applied to the device followed by 256 RESET pulses with amplitude linearly decreasing from -0.5 V to -0.65 V.The pulse width for both SET and RESET pulses is 10 ns.

Figure S6 .
Figure S6.The impact of initial  value and its relationship to maximum eigenvalue of Ising coupling matrix.a Ising coupling matrix.The maximum eigenvalue is 7.76.b Success probability of the problem instance in (a) vs. initial  value.The black dash line shows the position of initial  value equal to the maximum eigenvalue of Ising coupling matrix.In our QPA, the coefficient of the initial Hamiltonian should start with a large enough that the system can be initiated with a function that the ground state can be easily found, i.e. usually a convex function.Since in our updating scheme, the gradient calculated by the binary spin  ! is used as an estimator to update the analog proxy spin  ! .By replacing binary spin configuration  in the Ising Hamiltonian with analog spin configuration , we have:

Figure S7 .
Figure S7.The impact of step size.a Success probability and b average solution gap vs. step size.For this plot, the target problem of Fig. 2c was solved with 100 trials.The dash line shows the simulation result with our experimentally validated memristor model, which complies well with the experimental data.From the plot it can be seen that, when the step size is relatively small, increasing it can lead to improved system performance due to each iteration progressing further in the direction of gradient.However, as the step size becomes sufficiently large, the system will diverge.This aligns with the experience in neural network training.In fact, more advanced adaptive changing of step sizes is commonly implemented in the training of neural networks3,5,6 .The selection of optimal hyperparameters including step size, the coefficient of momentum term, is a vast and intricate problem that requires further research in the future.

Figure S8 .
Figure S8.Comparison of average solution gap (optimal maximum cut -obtained cut) of our QPA and SA.Experiment condition of this plot is exactly the same as results shown in Fig. 2g in the main text.

Figure S9 .
Figure S9.Experimental vector matrix multiplication result v.s.software calculated expected result.The gray dash line shows the line that experimental result perfectly matches the expected result.

Figure S10 .
Figure S10.The evolution of intermediate values in experimental solving process of weighted Max-Cut problem.(a) The evolving process of classical "superposition" (analog spin proxy).(b)The evolving process of momentum.During the early stage of the solving process, the classical "superposition" rapidly oscillates between -1 and 1 due to higher impact of initial Hamiltonian, and eventually converges to -1 or 1 to match the binary spin configuration.

Figure S11 .
Figure S11.Exploration of the effect of conductance relaxation on the performance by simulation with the experiment-extracted model.(a) Experimentally measured relationship between the standard deviation of the conductance variation of the whole array and time.The measuring condition is the same as Fig. S4(c).With the increasing of retention time, the conductance variation of the memristor device gradually grows.(b) Change of average success probability with regard to the retention time.(c) Change of normalized average solution gap with regard to the retention time.For (b) and (c), the result is obtained by simulation with the experimentally measured conductance variation in (a).It can be seen that though with a long power-down period, the non-volatile property of memristor enables our system to get acceptable solutions with limited performance degradation.

Figure S12 .
Figure S12.Success probability plot used for calculating TTS of weighted Max-Cut problem.a Success probability vs. problem size for simulated annealing b Success probability vs. problem size for quantum-inspired annealing.The result for quantum-inspired parallel annealing is obtained by simulation with experiment-validated memristor model.

Figure S13 .
Figure S13.Scaling trend for unweighted Max-Cut problem.a Success probability vs. problem size for simulated annealing b Success probability vs. problem size for quantum-inspired annealing.c Time to solution vs. problem size.For each problem size, 20 randomly generated problem instances solved 1000 times with random initial spin configuration.The shaded area in the plot shows interquartile range.The square marker indicates the median.The result for quantum-inspired parallel annealing is obtained by simulation with experiment-validated memristor model.

Figure S14 .
Figure S14.Verification of the scaling trend for larger problems with defect-free simulation.(a) The result of solving N=800 and N=2000 problems with random graphs from a MAX-CUT benchmark dataset, Gset (https://web.stanford.edu/~yyye/yyye/Gset/).For each problem instance, 1000 trials are conducted.The number in the bracket denotes the success probability of finding the obtained best cut.(b) Further scaling of the trend in Fig.S12(c) to N=2000.One major challenge for benchmarking the performance of large-size problems is to find exact solutions for them.As brute-of-force search becomes impossible for problems with N>60 even with the most advanced digital hardware due to the exponential explosion nature of the problem.Thus, Gset problems drawn from a public problem set are used for benchmarking.It can be seen that the scaling trend of solving larger problems is basically consistent with the fitness curve found in Fig.S12(c) with some deviations.There are two major reasons for the deviation.Firstly, problem instances in the Gset are much sparser than problems we solved in Fig.S12(density=0.06for N=800, density=0.01for N=2000 compared with density=0.5 for N=10-100) and are generally easier to be solved according to our finding in Fig.2(g) of the main text.Secondly, the simulation conducted in Fig.S12considered the variation of the memristor device, while the simulation here is defect-free, which results in higher performance.

Figure S15 .
Figure S15.Exploration of the effect of the hardness of the problem with defect-free simulation.The success probability distribution of 1000 50%-density unweighted Max-Cut problem instances solved by (a)-(b) Original QPA with (a) 1000 iterations and (b) 10000 iterations.(c)-(d) QPA without discretizing the proxy spin value when calculating the gradient with (d) 1000 iterations and (e) 10000 iterations.(e)-(f) SA with (e) 10000 iterations and (f) 100000 iterations.For each problem instance, 100 random initialized trials are conducted to calculate the success probability.It can be seen that the histogram for QPA without STE is bimodal, meaning that it separates the problem instances into two groups, hard ones and easy ones.The algorithm almost always succeeds on easy problems and fails on hard problems.This kind of bimodal distribution is similar to the result obtained by the D-Wave quantum system and classical dynamics7 .However, for original QPA and SA, the distribution is relatively unimodal.Further increasing of the total iteration number drives the distribution of QPA without STE more bimodal (The number of both easy and hard problems increases, while the number of problems in the middle decreases) but does not significantly improve the overall success probability.However, for both original QPA and SA, increased iterations cause the unimodal distribution to shift right, which stands for overall improved performance.The detailed mechanism behind the clear separation of hard and easy problems remains unclear.Here, we have a preliminary intuitive explanation.For QPA without STE, the whole solving process is based on the evolution of the analog spin value, meaning that the system tends to evolve to the ground state in the "analog" solution space, which might deviate from the true ground state in the binary solution space for some problem instances.Increasing iterations increases the ability to find the ground state in the "analog" solution space but does not help with finding the true ground state of binary spin.However, the introduction of STE in our QPA utilizes the binary spin value for the gradient calculation, which involves the true binary spin in the solving process and thus corrects the mismatch caused by the analog variable and leads to the true ground state.This might also explain why our method behaves better than those based on continuous-time dynamics such as optical and electronic oscillator-based systems.

Figure S16 .
Figure S16.TSP instance and traveling path comparison.a The target problem instance solved in this paper.b The corresponding distance matrix between each city with the unit of km.The distance between each two cities is the shortest distance along the surface of earth and calculated by the coordinates of the city center.c The spin value matrix representation of the minimum-distance traveling path found by QPA.d Its corresponding traveling path in map.e The spin value matrix representation of the minimum-distance traveling path found by SA. f Its corresponding traveling path in map.g The spin value matrix representation of the minimum-distance traveling path found by DHNN.h its corresponding traveling path in map.

Figure S17 .
Figure S17.Simulation result of solving larger problems in TSPLIB 8 .a Result of solving burma14 problem with 14 cities b Result of solving ulysses16 problem with 16 cities c Result of solving gr17 problem with 17 cities.The Best in the plot represents the best-known solution provided by the library 8 .For each problem, 5000 iterations are conducted for both QPA and SA with 100 trials from random initial states.